Compound's classification based on their SMILES representation

qHTS for Inhibitors of human tyrosyl-DNA phosphodiesterase 1 (TDP1): qHTS in cells in absence of CPT

Etapa 1 e 2

Carla Rafaela Silva, pg42862; José Pereira, pg42871; Tiago Silva, pg42885.

Introduction

Human tyrosyl-DNA phosphodiesterase 1 (TDP1) is a novel repair gene, and we propose to use it as a new target for anticancer drug development. TDP1 is not an essential protein, but under treatment with topoisomerase I poison (camptothecin: CPT), TDP1 works as a critical factor for cell survival. To directly identify novel TDP1 inhibitors active in a cellular environment, we have knocked-out the Tdp1 gene in chicken DT40 cells (Tdp1-/-) and generated a complemented counterpart cells that contains a stable transfection of the human TDP1 gene (Tdp1-/-;hTDP1 cells). For the primary screen, Tdp1-/-;hTDP1 cells will be exposed to small molecules in the presence or absence of CPT, and their growth kinetics will be evaluated after 48 hours by measuring ATP activity. If a given compound shows a synergistic effect with CPT, this compound could inhibit the repair pathway of CPT-induced lesions including the TDP1-mediated repair pathway. The hit compounds will then be evaluated in the presence or absence of CPT using Tdp1-/- cells. If a compound shows synergistic effect with CPT in Tdp1-/-;hTDP1 cells, but not with Tdp1-/- cells, such compound could be involved in the TDP1-mediated repair pathway inhibition. In tertiary assays, biochemical gel-based assays will be used to assess whether the hit compounds specifically target TDP1.

Imports

Initial exploration

Import dataset

The first step, analyzing this dataset, includes loading and displaying TDP1 data.

Simple Analyses

The following step was taken to analyze how data presents itself along the lines and columns of the dataset.

This dataset was loaded under the name 'dataset'. It has 40,000 distinct molecules and 48 variables. In total, there are 1,920,000 data entries.

ColumnsName Description
PUBCHEM_RESULT_TAG This column contains an increasing number starting from one.
PUBCHEM_SID PubChem SubstanceID
PUBCHEM_CID PubChem CompoundID
PUBCHEM_ACTIVITY_OUTCOME This field allows knowing the activity through a value. The value is set to 0 indicates that it is inactive or 1, indicating that it is active.
PUBCHEM_ACTIVITY_SCORE The activity of a test result may be assigned a normalized score between 0 and 100 where the most active result rows have scores closer to 100 and inactive closer to 0, so that one can rank the result based on this data and prioritize hits
PUBCHEM_ACTIVITY_URL An URL may optionally be provided for Assay Data reported for this Substance in this column.
PUBCHEM_ASSAYDATA_COMMENT Textual annotation and comments
Potency Concentration at which compound exhibits half-maximal efficacy
Efficacy Maximal efficacy of compound, reported as a percentage of control
Analysis Comment Annotation/notes on a particular compound's data or its analysis
Activity_Score Activity score
Curve_Description A description of dose-response curve quality
Fit_LogAC50 The logarithm of the AC50 from a fit of the data to the Hill equation (calculated based on Molar Units)
Fit_HillSlope The Hill slope from a fit of the data to the Hill equation
Fit_R2 R^2 fit value of the curve. Closer to 1.0 equates to better Hill equation fit
Fit_InfiniteActivity The asymptotic efficacy from a fit of the data to the Hill equation
Fit_ZeroActivity Efficacy at zero concentration of compound from a fit of the data to the Hill equation
Fit_CurveClass Numerical encoding of curve description for the fitted Hill equation
Excluded_Points Which dose-response titration points were excluded from analysis based on outlier analysis
Max_Response Maximum activity observed for compound (usually at highest concentration tested)
Activity at xx uM* % Activity at given concentration
Compound QC NCGC designation for data stage: 'qHTS', 'qHTS Verification', 'Secondary Profiling'
smiles SMILES (Simplified Molecular Input Line Entry System) is a chemical notation that allows a user to represent a chemical structure in a way that can be used by the computer.

*Activity at xx uM refers to all columns that shows the activity of a molecule at a certain concentration.

Pre-Processing

The number of non attributed values (NA's) will be counted.

Visualization of the NA's

As we can see, a few columns are filled by NA's, such as "PUBCHEM_ASSAYDATA_COMMENT" and "Analysis Comment". Therefore, these columns do not provide any type of information to the dataset. It is important to note that are 10 molecules with missing SMILE.

We can observe that more than 50% of all data entries are NA's.

Drop specific features

3 columns consisting only of NA's were removed, which reduced the dataset to 45 columns in total. Columns whose information will not be helpful for further analysis were also removed. More specifically, the columns "PUBCHEM_ACTIVITY_URL" and "Compound QC" have been removed, reducing the column total to 43. The 10 molecules that did not have SMILE notation were removed from the dataset.

To help with future analysis, the "PUBCHEM_ACTIVITY_OUTCOME" categorical variable was transformed into a binary variable.

Graphic Exploration

Activity_outcome and Phenotype

As we can see in the "PUBCHEM_Activity_Outcome" pie chart, the data is balanced for binary classification. The overall multiclass is imbalanced in the "Phenotype" pie chart. However, the data is balanced between the 'Inactive' and 'Inhibitor' phenotypes.

Boxplots of Activity at 0.00299 uM, 0.363 uM, 1.849 uM, 9.037 uM and 46.23 uM

Compound Standardization

In this step, we will standardize the molecules, and it varies from isotope removal to stereochemistry removal. This standardization is done in the following order:

Feature Generation

This step is divided into molecular descriptors and molecular fingerprints. Molecular Descriptors are the final result of a logic and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment [1]. Some examples are molecular weight, polar surface area, number of rings, and number of aromatic rings.

Molecular Fingerprints are a way of encoding the structure of a molecule. The most common type of fingerprint is a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints allows you to determine the similarity between two molecules.

[1] Todeschini, R., Consonni, V. (2000). Handbook of Molecular Descriptors. Methods and Principles in Medicinal Chemistry. Wiley. doi:10.1002/9783527613106.

Molecular Descriptors

Create dataframe with feature names

As we can see, after generating the molecular descriptors, we ended up with 208 features.

We selected 4 of these descriptors to further examine the distribution of these characteristics. ExactMolWt corresponds to the molecular weight of the molecule. NumAromaticRings enumerates the amount of aromatic rings. RingCount enumerates the amount of rings. TPSA or topological polar surface area corresponds to the polar surface area of the molecule.

We can observe that the median molecular weight is slightly higher on the active molecules when comparing both box plots.

We can observe that the median ring count is slightly higher on the active molecules when comparing both box plots.

Comparing both box plots, we can observe that the median number of aromatic rings is slightly higher on the active molecules.

Unlike the previous results, we can observe that the active molecules have a slightly lower median topological polar surface area than the inactive ones.

Normalize Data

Molecular Fingerprints

We are going to study three different ways of constructing fingerprints. MorganFingerprint, RDKFingerprint and MACCSkeysFingerprint.

Both Morgan and RDK fingerprint techniques produced 2048 features while MACCSkeys produced only 167 features.

Feature Selection

Variance is the measurement of the spread between numbers in a variable. It measures how far a number is from the mean and every number in a variable. The variance of a feature determines how much it impacts the response variable. If the variance is low, it implies no impact of this feature on response and vice-versa. To select the features with the most variance, we applied the boruta algorithm to the molecular descriptors and selected 10% of the highest-ranking features of the molecular fingerprints.

We chose the boruta algorithm because it follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier. We did not use this algorithm for the fingerprints because it did not work with that type of data. To remedy this, we choose the SelectPercentile.

Molecular Descriptors

After the feature selection, the number o features was reduced almost in half, dropping from 208 to 82 features. ExactMolWt, NumAromaticRings, RingCount, and TPSA were maintained after feature selection.

Molecular Fingerprints

After the feature selection, the number o features was reduced from 2048 to 205 features on the Morgan and RDK fingerprints. The MACCSkeysFingerprint was reduced to 17 features.

Unsupervised exploration

Principal Component Analysis (PCA) is a dimension-reduction tool and a statistical procedure that can reduce a large set of variables to a small set that still contains most of the information of the larger set. It uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (PC). This procedure can explain the variance-covariance structure of the data.

t-distributed Stochastic Neighbor Embedding (t-SNE) is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition N observations into k clusters. Each observation belongs to the cluster with the nearest mean (cluster centroid), serving as a cluster prototype.

Descriptors

Principal Component Analysis (PCA)

The first two principal components explain 45% of the data variance. To explain 95% of the variance, 29 principal components are required.

In general, it is difficult to distinguish between active and inactive molecules. The PCA graph shows that our data does not distribute well along the first principal component, which explains 34% of the variance. Also, there is a separation along the second principal component, which explains 10% of the variance.

t-distributed Stochastic Neighbor Embedding (t-SNE)

From the t-SNE graph, we observe a small separation of the data along the first dimension.

k-Means

Accordingly to the k-Means graph, there is no clear separation between the clusters.

Fingerprints

MorganFingerprint

Principal Component Analysis (PCA)

The first two principal components explain 44% of the data variance. To explain 95% of the variance, 34 principal components are required.

The PCA graph shows that our data does not distribute well along the first two principal components, which explains 44% of the variance. In general, it is difficult to distinguish between active and inactive molecules.

t-distributed Stochastic Neighbor Embedding (t-SNE)

From the t-SNE graph, we observe no clear separation between the dimensions.

k-Means

Accordingly to the k-Means graph, there is no clear separation between the clusters.

RDKFingerprint

Principal Component Analysis (PCA)

The first two principal components explain 88% of the data variance. To explain 95% of the variance, 12 principal components are required.

The PCA graph shows that our data has a slight separation along the first principal component, which explains 83% of the variance. There is also a small separation of the data along the second principal component, which explains 4% of the variance. Even though the first two principal components explain 88% of the variance, it is still difficult to distinguish the molecules accordingly to their activity.

t-distributed Stochastic Neighbor Embedding (t-SNE)

From the t-SNE graph, we observe a slight separation along the first dimension.

k-Means

Accordingly to the k-Means graph, there is a clear separation between the clusters.

MACCSkeysFingerprint

Principal Component Analysis (PCA)

The first two principal components explain 64% of the data variance. To explain 95% of the variance, 9 principal components are required.

The PCA graph shows that our data does not distribute well along the first two principal components, explaining 64% of the variance. In general, it is difficult to distinguish between active and inactive molecules.

t-distributed Stochastic Neighbor Embedding (t-SNE)

From the t-SNE graph, we observe there is no separation of the data along the two dimensions.

k-Means

Accordingly to the k-Means graph, there is no clear separation between the clusters.

Conclusion

From the TDP1 activity dataset, we extracted all the molecules' SMILE and activity. Using the SMILE, we obtained two types of features: descriptors and fingerprints. These were related to the active state of the corresponding molecule. These two types of features were examined using PCA and clustering. However, these analyses were inconclusive. Therefore, it isn't easy to distinguish the molecules according to their active state. Nonetheless, we think it is possible to proceed to supervised learning using the descriptors and the RDKFingerprint technique (the one that achieved the better results). We trust that we can obtain better results in classifying the molecules' active state.